switch transformer
Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition
Gu, Zijin, Likhomanenko, Tatiana, Jaitly, Navdeep
Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model Omni-router Transformer. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
Ai, Mengting, Wei, Tianxin, Chen, Yifan, Zeng, Zhichen, Zhao, Ritchie, Varatkar, Girish, Rouhani, Bita Darvish, Tang, Xianfeng, Tong, Hanghang, He, Jingrui
Mixture-of-Experts (MoE) Transformer, the backbone architecture The profound impact of the Transformer architecture in the domain of multiple phenomenal language models, leverages sparsity of machine learning is undeniable, for the fields including by activating only a fraction of model parameters for each input natural language processing [3, 14, 18, 45, 48, 61] and computer token. The sparse structure, while allowing constant time costs, vision [17, 39, 64], to name a few. To further improve the capabilities results in space inefficiency: we still need to load all the model of pre-trained large language models (LLMs), one general parameters during inference. We introduce ResMoE, an innovative strategy is to scale up their parameters. Mixture-of-Experts (MoE) MoE approximation framework that utilizes Wasserstein barycenter [52] extends the traditional feedforward neural network (FFN) layer to extract a common expert (barycenter expert) and approximate by replacing a single multilayer perceptron (MLP) with multiple the residuals between this barycenter expert and the original ones. MLPs, referred to as "experts". While enhancing the performance, ResMoE enhances the space efficiency for inference of large-scale sparse MoE keeps computing costs (FLOPs) comparable to the original MoE Transformers in a one-shot and data-agnostic manner without dense model, as only a few selected experts will be activated retraining while maintaining minimal accuracy loss, thereby each time. The framework of an MoE layer is demonstrated in paving the way for broader accessibility to large language models.
Tight Clusters Make Specialized Experts
Nielsen, Stefan K., Teo, Rachel S. Y., Abdullaev, Laziz U., Nguyen, Tan M.
Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the MoE model is the router, which learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts. However, latent clusters may be unidentifiable in high dimension, which causes slow convergence, susceptibility to data contamination, and overall degraded representations as the router is unable to perform appropriate token-expert matching. We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. We use these weights to compute the token-expert routing assignments in an adaptively transformed space that promotes well-separated clusters, which helps identify the best-matched expert for each token. In particular, for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of MoE backbones for language modeling and image recognition tasks in both clean and corrupted settings.
Approximating Two-Layer Feedforward Networks for Efficient Transformers
Csordรกs, Rรณbert, Irie, Kazuki, Schmidhuber, Jรผrgen
How to reduce compute and memory requirements of neural networks (NNs) without sacrificing performance? Many recent works use sparse Mixtures of Experts (MoEs) to build resource-efficient large language models (LMs). Here we introduce several novel perspectives on MoEs, presenting a general framework that unifies various methods to approximate two-layer NNs (e.g., feedforward blocks of Transformers), including product-key memories (PKMs). Leveraging insights from this framework, we propose methods to improve both MoEs and PKMs. Unlike prior work that compares MoEs with dense baselines under the compute-equal condition, our evaluation condition is parameter-equal, which is crucial to properly evaluate LMs. We show that our MoEs are competitive with the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales, while being much more resource efficient. This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs. Our code is public.
Emergent Modularity in Pre-trained Transformers
Zhang, Zhengyan, Zeng, Zhiyuan, Lin, Yankai, Xiao, Chaojun, Wang, Xiaozhi, Han, Xu, Liu, Zhiyuan, Xie, Ruobing, Sun, Maosong, Zhou, Jie
This work examines the presence of modularity in pre-trained Transformers, a feature commonly found in human brains and thought to be vital for general intelligence. In analogy to human brains, we consider two main characteristics of modularity: (1) functional specialization of neurons: we evaluate whether each neuron is mainly specialized in a certain function, and find that the answer is yes. (2) function-based neuron grouping: we explore finding a structure that groups neurons into modules by function, and each module works for its corresponding function. Given the enormous amount of possible structures, we focus on Mixture-of-Experts as a promising candidate, which partitions neurons into experts and usually activates different experts for different inputs. Experimental results show that there are functional experts, where clustered are the neurons specialized in a certain function. Moreover, perturbing the activations of functional experts significantly affects the corresponding function. Finally, we study how modularity emerges during pre-training, and find that the modular structure is stabilized at the early stage, which is faster than neuron stabilization. It suggests that Transformers first construct the modular structure and then learn fine-grained neuron functions. Our code and data are available at https://github.com/THUNLP/modularity-analysis.
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
Du, Zhixu, Li, Shiyu, Wu, Yuhao, Jiang, Xiangyu, Sun, Jingwei, Zheng, Qilin, Wu, Yongkai, Li, Ang, Li, Hai "Helen", Chen, Yiran
Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA (Sparsity-inspired Data-Aware), an efficient inference approach tailored for large MoE models. SiDA judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA attains a remarkable speedup in MoE inference with up to 3.93X throughput increasing, up to 75% latency reduction, and up to 80% GPU memory saving with down to 1% performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even in memory-constrained systems.
Condensing Multilingual Knowledge with Lightweight Language-Specific Modules
Xu, Haoran, Tan, Weiting, Li, Shuyue Stella, Chen, Yunmo, Van Durme, Benjamin, Koehn, Philipp, Murray, Kenton
Incorporating language-specific (LS) modules is a proven method to boost performance in multilingual machine translation. This approach bears similarity to Mixture-of-Experts (MoE) because it does not inflate FLOPs. However, the scalability of this approach to hundreds of languages (experts) tends to be unmanageable due to the prohibitive number of parameters introduced by full-rank matrices in fully-connected layers. In this work, we introduce the Language-Specific Matrix Synthesis (LMS) method. This approach constructs LS modules by generating low-rank matrices from two significantly smaller matrices to approximate the full-rank matrix. Furthermore, we condense multilingual knowledge from multiple LS modules into a single shared module with the Fuse Distillation (FD) technique to improve the efficiency of inference and model serialization. We show that our LMS method significantly outperforms previous LS methods and MoE methods with the same amount of extra parameters, e.g., 1.73 BLEU points over the Switch Transformer on many-to-many multilingual machine translation. Importantly, LMS is able to have comparable translation performance with much fewer parameters.
Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity
Xu, Haoran, Elbayad, Maha, Murray, Kenton, Maillard, Jean, Goswami, Vedanuj
Mixture-of-experts (MoE) models that employ sparse activation have demonstrated effectiveness in significantly increasing the number of parameters while maintaining low computational requirements per token. However, recent studies have established that MoE models are inherently parameter-inefficient as the improvement in performance diminishes with an increasing number of experts. We hypothesize this parameter inefficiency is a result of all experts having equal capacity, which may not adequately meet the varying complexity requirements of different tokens or tasks. In light of this, we propose Stratified Mixture of Experts (SMoE) models, which feature a stratified structure and can assign dynamic capacity to different tokens. We demonstrate the effectiveness of SMoE on three multilingual machine translation benchmarks, containing 4, 15, and 94 language pairs, respectively. We show that SMoE outperforms multiple state-of-the-art MoE models with the same or fewer parameters.